Workshop Recordings can be found on the Learning Resources Site: https://blackboard.qut.edu.au/bbcswebdav/pid-1481644-dt-announcement-rid-58665563_1/courses/MXB107_22se2/_site/videos.html.

In this document, I’ll only write things that are not included in the official Workshop resources (above).

Workshop 4 (23/8/22)

Largest Dice Roll

Imagine everyone in this class is lost in a far-away desert. There is not enough food, so we decide who get to eat by rolling two dices.

Team A would win if the largest face rolled between the two is 1,2,3 or 4.

Team B win otherwise (if the largest face is 5 or 6).

Let \(A\) be the event that team A win, and \(B\) the event that team B win.

1 2 3 4 5 6
1 A A A A B B
2 A A A A B B
3 A A A A B B
4 A A A A B B
5 B B B B B B
6 B B B B B B

By counting, we observe that:

\[ \begin{aligned} Pr(A) &= \dfrac{16}{36} \approx 44.4\% \\ Pr(B) &= \dfrac{20}{36} \approx 55.6\% \end{aligned} \]

We see that team B actually wins more often than team A.

Bayesian: Rare Disease

There is a rare yet deadly disease that affects 0.1% of the population. To diagnose the disease, scientists have made a test kit with 99% accuracy. This is, if the patient indeed has the disease, it will test positive 99% of the time (and if you are sick, there is a 1% chance that it says negative). This also means that even if you are not sick, it can give positive result 1% of the time.

Positive Once

What is the probability that you have the disease given that you tested positive once?

Answer

Let \(S\) be the event that you are sick, and \(P\) be the event that you test positive.

\[ \begin{aligned} Pr(S) &= 0.001 \\ Pr(P | S) &= 0.99 \\ Pr(P | S^c) &= 0.01 \\ Pr(P^c |S) &= 0.01 \end{aligned} \]

We need to find \(Pr(S|P)\), Bayesian Theorem would state that:

\[ \begin{aligned} Pr(S|P) &= \dfrac{Pr(P|S)Pr(S)}{Pr(P)} \\ Pr(P) &= Pr(P|S)Pr(S) + Pr(P|S^c)Pr(S^c) & (\text{Law of total probability}) \\ \\ \text{Substitution:} \\ \Rightarrow Pr(S|P) &= \dfrac{Pr(P|S)Pr(S)}{Pr(P|S)Pr(S) + Pr(P|S^c)Pr(S^c)} \\ &= \dfrac{0.99\times0.001}{0.99\times 0.001 + 0.01\times0.999} \\ &\approx 0.0901\approx9.01\% \end{aligned} \]

Hence, if you tested positive once, the probability that you actually have the disease is 9.01% (which is not that bad!).

Positive Twice*

*This is a challenging question for extension.

Now, you took another test and got a positive result. What is the probability that you have the disease now?

Answer

Renamed the above \(P\) to \(P_1\), we have:

\[ Pr(S|P_1) \approx 0.0901 \] For two tests, we have:

\[ Pr(S|P_2 \cap P_1) = \dfrac{Pr(P_2 \cap P_1 | S)Pr(S)}{Pr(P2\cap P_1|S)Pr(S) + Pr(P_2 \cap P_1|S^c)Pr(S^c)} \]

Realise that \(P_1\) and \(P_2\) are independent, we see that

\[ Pr(P_2\cap P_1|S)=Pr(P_2|S)Pr(P_1|S)\\ Pr(P_2\cap P_2|S^c)=Pr(P_2|S^c)Pr(P_1|S^c) \]

Hence,

\[ \begin{aligned} Pr(S|P_2 \cap P_1) &= \dfrac{Pr(P_2 \cap P_1 | S)Pr(S)}{Pr(P2\cap P_1|S)Pr(S) + Pr(P_2 \cap P_1|S^c)Pr(S^c)}\\ &=\dfrac{Pr(P_2|S)Pr(P_1|S)Pr(S)}{Pr(P_2|S)Pr(P_1|S)Pr(S) + Pr(P_2|S^c)Pr(P_1|S^c)Pr(S^c)}\\ &=\dfrac{0.99\times0.99\times0.001}{0.99\times0.99\times0.001 + 0.01\times0.01\times0.99}\\ &\approx0.9082 \approx 90.82\% \end{aligned} \]

So the new probability is now 90.82%.

Workshop 5 (30/8/22)

Sum of Two Dices

Let X be the random variable generated by summing the results of two dices.

a. Find the probability mass function

Remember, probability mass function is the probability of each event X:
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

\[ p(x)=Pr(X=x), x\in \{2,3,4,5,6,7,8,9,10,11,12\} \]

By counting, we find that

\[ p(x)=\begin{cases} \dfrac{1}{36} & ,x=2,12 \\ \dfrac{2}{36} & ,x=3,11 \\ \dfrac{3}{36} & ,x=4,10 \\ \dfrac{4}{36} & ,x=5,9 \\ \dfrac{5}{36} & ,x=6,8 \\ \dfrac{6}{36} & ,x=7 \\ \end{cases} \]

b. Find the mean

\[ \begin{aligned} E[X] &= \sum_{S}xp(x)\\ &= \sum_{x=2}^{12}xp(x)\\ &= \dfrac{1}{36} \times 2 + \dfrac{2}{36} \times 3 + \dfrac{3}{36} \times 4 + \dfrac{4}{36} \times 5 + \dfrac{5}{36} \times 6 + \dfrac{6}{36} \times 7 \\ &+ \dfrac{5}{36} \times 8 + \dfrac{4}{36} \times 9 + \dfrac{3}{36} \times 10 + \dfrac{2}{36} \times 11 + \dfrac{1}{36} \times 12\\ &= 7 \end{aligned} \]

Hence the mean is 7.

c. Find the median

\[ \text{median of X} = \left\{ x \in S | Pr(X\leq x) \geq \dfrac{1}{2}, Pr(X\geq x) \geq \dfrac{1}{2} \right\} \]

Try \(x = 8\):

\[ \begin{aligned} Pr(X \leq 8) &= Pr(X=1)+Pr(X=2)+Pr(X=3)+\cdots+Pr(X=8)\\ &= \dfrac{1}{36} + \dfrac{2}{36} + \dfrac{3}{36} + \dfrac{4}{36} + \dfrac{5}{36} + \dfrac{6}{36} + \dfrac{5}{36}\\ \approx&0.72 \geq \dfrac{1}{2}\\ Pr(X \geq 8) &= Pr(X=8)+Pr(X=9)+Pr(X=10)+Pr(X=11)+Pr(X=12)\\ &= \dfrac{5}{36} + \dfrac{4}{36} + \dfrac{3}{36} + \dfrac{2}{36} + \dfrac{1}{36} \\ \approx&0.42 \leq \dfrac{1}{2} \end{aligned} \]

Hence \(x = 8\) is not the median of X.

Try \(x = 7\):

\[ \begin{aligned} Pr(X \leq 7) &= Pr(X=1)+Pr(X=2)+Pr(X=3)+\cdots+Pr(X=8)\\ &= \dfrac{1}{36} + \dfrac{2}{36} + \dfrac{3}{36} + \dfrac{4}{36} + \dfrac{5}{36} + \dfrac{6}{36}\\ \approx&0.58 \geq \dfrac{1}{2}\\ Pr(X \geq 7) &= Pr(X=8)+Pr(X=9)+Pr(X=10)+Pr(X=11)+Pr(X=12)\\ &= \dfrac{6}{36} + \dfrac{5}{36} + \dfrac{4}{36} + \dfrac{3}{36} + \dfrac{2}{36} + \dfrac{1}{36} \\ \approx&0.58 \geq \dfrac{1}{2} \end{aligned} \]

Hence \(x = 7\) is the median of X.

d. Find the mode

\[ \text{mode of X} = \max_{x \in S}p(x) \]

In this case, we know

\[ S = \{2,3,4,5,6,7,8,9,10,11,12\} \]

and based on the \(p(x)\) defined above, the most common sum is 7, where \(p(7) = \dfrac{6}{36}\).

Hence the mode is 7.

Since mean = median = mode, we say \(X\) is defined by a symmetric, uni-modal distribution (i.e. no skew).

e. Find the variance and standard deviation

\[ \begin{aligned} Var[X] &= E[(X-\mu]^2]\\ &= \sum_S(x-\mu)^2p(x)\\ &= \sum_{x=2}^{12}(x-7)^2p(x)\\ &= (2-7)^2 \times \dfrac{1}{36} + (3-7)^2 \times \dfrac{2}{36} + (4-7)^2 \times \dfrac{3}{36} + \cdots + (11-7)^2 \times \dfrac{2}{36} + (12-7)^2 \times \dfrac{1}{36}\\ \approx& 5.83\\ \sigma\approx& \sqrt{5.83} \approx2.41 \end{aligned} \]

Hence the variance of X is 5.83 and the standard deviation is 2.41.

Workshop 6 (6/9/22)

Cummulative Probability Distribution

Given a PDF:

\[ f(x) = \dfrac{3x^2}{8}, 0<x<2 \]

Then the CDF is:

\[ \begin{aligned} F(x) &= \int_{-\infty}^{x}f(u)du\\ &= \int_{-\infty}^{0}f(u)du + \int_{0}^xf(u)du\\ &= 0 + \int_0^xf(u)du\\ &= \int_0^x\dfrac{3u^2}{8}du\\ &= \left[\dfrac{3u^3}{3\times8}\right]_0^x\\ &= \dfrac{x^3}{8} - 0\\ &= \dfrac{x^3}{8} \end{aligned} \]

Check if the CDF of the upper limit is 1.

\[ F(2) = \dfrac{2^3}{8} = 1 \]

Butter weight

Example: A dairy produces packages of butter using a machine that is set to produce 250 g block of butter with a standard deviation of 10 g. A sample of size n=13 blocks of butter produce an average weight of \(\bar{x}\)=253 g. What is the probability of observing a sample average weight of 253 g or more?

Set up the solution by hand and use R to compute the probability.

Answer

What we have known:

\[ \mu = 250g\\ \sigma = 10g\\ n=13 \text{ blocks} \] Apply the central limit theorem:

\[ \bar{x} \sim N\left(\mu, \dfrac{\sigma^2}{n}\right)\\ \bar{x} \sim N\left(250, \dfrac{10^2}{13}\right)\\ \] In a normal distribution, the first parameter is the mean, and second is the variance.

And the standard deviation of the distribution of means is \[ \begin{aligned} s &= \sqrt{\dfrac{\sigma^2}{n}} \\ &= \dfrac{\sigma}{\sqrt{n}} \\ &= \dfrac{10}{\sqrt{13}} \end{aligned} \]

To convert from any random variable to Z score:

\[ Z = \dfrac{X - \mu}{s} \]

Hence:

\[ \begin{aligned} Pr(\bar{x} \geq 253) &= Pr\left(\dfrac{\bar{x} - \mu}{s} \geq \dfrac{253 - \mu}{s}\right)\\ &= Pr\left(Z \geq \dfrac{253 - \mu}{s}\right)\\ &= Pr\left(Z \geq \dfrac{253 - \mu}{\sigma / \sqrt{n}}\right)\\ &= Pr\left(Z \geq \dfrac{253-250}{10/\sqrt{13}}\right)\\ &= Pr\left(Z \geq 1.08167\right) \end{aligned} \]

For a number \(t\) up table shows \(Pr(Z < t)\). We want \(Pr(Z \geq t) = 1 - Pr(Z < t)\)

From lookup table:

\[ Pr(Z < 1.08) \approx 0.85993\\ Pr(Z > 1.08) \approx 1 - 0.85993\approx0.14007 \]

pnorm(253, mean = 250, sd = 10/sqrt(13), lower.tail = FALSE)
## [1] 0.1397006
pnorm(1.08167, mean = 0, sd = 1, lower.tail = FALSE)
## [1] 0.1396996

Therefore, the probability of observing a sample average weight of 253g or more is approximately 0.14.

Online Workshop Preference

Example

Assume that in Semester 2 of a sample of 100 QUT students living in the Greater Brisbane Area, 47 replied that they preferred on-line workshops. Results of previous surveys from Semester 1 indicated that 35% of students preferred on-line workshops. What is the probability of observing the proportion from Semester 2, or more given the Semester 1 results are accurate?

The central limit theorem for a sample proportion has that: \[ \hat{p} \sim N\left(p,\dfrac{p(1-p)}{n}\right) \]

Given:

\[ n = 100\\ p = \dfrac{x}{n} = \dfrac{47}{100}= 0.47\\ \]

Let \(\hat{p}\) be the proportion of students who prefer online workshops in semester 2

Variance of \(\hat{p} = \dfrac{p(1-p)}{n}=\dfrac{0.35\times(1-0.35)}{100}=0.002275\). And the standard deviation of \(\hat{p}\) is \(s = \sqrt{0.002275}\)

\[ \begin{aligned} Pr(\hat{p} > 47) &= Pr(\dfrac{\hat{p} - 0.35}{\sqrt{0.002275}} > \dfrac{0.47 - 0.35}{\sqrt{0.002275}})\\ &= Pr(Z > 2.515)\\ &\approx 0.0062 \end{aligned} \]

pnorm(2.5, lower.tail = FALSE)
## [1] 0.006209665

Therefore, the probability of observing the proportion from semester 2 (0.47) or more, givne the results from semester 1 is accurate is approximately 0.0062 or 0.62%.

Workshop 7 (13/9/22)

Method of Moments

Find the method of moments estimator for the exponential distribution where

\[ f(x) = \lambda e^{-\lambda x} \]

First moment is the mean of the sample:

\[ m_1 = \tilde{E[X]} \]

Where X follows an exponential distribution. We can find E\[X\] in terms of \(\lambda\), then solve for \(\lambda\).

\[ E[X] = \int_{_x}{\lambda e^{-\lambda x} dx}\\ = \dfrac{1}{\lambda} \]

\[ m_1 = \dfrac{1}{\lambda} \\ \lambda = \dfrac{1}{m_1} \] Because we need to find an estimation from the sample mean, not an exact population mean, change the notation accordingly:

\[ \lambda \rightarrow \tilde{\lambda}\\ m_1 \rightarrow \bar{x} \]

Therefore, the method of moments estimator for the exponential distribution is:

\[ \tilde{\lambda} = \dfrac{1}{\bar{x}} \]

Method of Maximum Likelihood Estimation

\[ f(x) = \lambda e^{-\lambda x} \]

This gives the maximum likelihood function:

\[ \begin{aligned} L(\lambda|\mathbf{x}) &= \prod_{i=1}^{n}f(x_i) \\ &=e^{-\lambda x_1}e^{-\lambda x_2}\cdots e^{-\lambda x_n} \end{aligned} \]

The log likelihood function:

\[ \begin{aligned} l(\lambda|\mathbf{x}) &= log(L(\lambda|\mathbf{x})) \\ &= log(\lambda e^{-\lambda x_1}\lambda e^{-\lambda x_2}\cdots \lambda e^{-\lambda x_n})\\ &= log(\lambda e^{-\lambda x_1}) + log(\lambda e^{-\lambda x_2}) + \cdots + log(\lambda e^{-\lambda x_n})\\ &= \sum_{i=1}^{n}log(\lambda e^{-\lambda x_i})\\ &= \sum_{i=1}^{n}(log(\lambda) + log(e^{-\lambda x_i}))\\ &= \sum_{i=1}^{n}(log(\lambda) -\lambda x_i)\\ &= n\log(\lambda) - \lambda \sum_{i=1}^{n}x_i \end{aligned} \]

Find the derivative of the log likelihood function:

\[ \begin{aligned} \dfrac{d}{d\lambda}l(\lambda|\mathbf{x}) &= \dfrac{d}{d\lambda}nlog(\lambda) - \dfrac{d}{d\lambda}(\lambda\sum_{i=1}^{n}x_i) \\ &= n \dfrac{1}{\lambda} - \sum_{i=1}^{n}x_i \end{aligned} \]

Set the derivative to 0 and find \(\lambda\).

$$

\[\begin{aligned} n \dfrac{1}{\lambda} - \sum_{i=1}^{n}x_i &= 0\\ \dfrac{n}{\lambda} &= \sum_{i=1}^{n}x_i\\ \lambda &= \dfrac{n}{\sum_{i=1}^{n}}\\ \lambda &= \dfrac{1}{\dfrac{\sum_{i=1}^{n}x_i}{n}}\\ \lambda &= \dfrac{1}{\bar{x}} \end{aligned}\]

$$

Because we are finding and estimation, \(\lambda \rightarrow \hat{\lambda}\)

\[ \hat{\lambda} = \dfrac{1}{\bar{x}} \]

Bonus: Finding log likelihood of Poisson

\[\begin{aligned} L(x|\mathbf{x}) &= \prod_{i=1}^{n}\dfrac{\lambda^{x_i}}{x_i!}e^{-\lambda} \\ l(x|\mathbf{x}) &= \sum_{i=1}^{n}log(\dfrac{\lambda^{x_i}}{x_i!}e^{-\lambda})\\ &= \sum_{i=1}^{n}(log(\lambda ^{x_i}) + log(\dfrac{1}{x_i!}) + log(e^{-\lambda}))\\ &= \sum_{i=1}^{n}(x_ilog(\lambda) - log(x_i!) - \lambda)\\ &= log(\lambda)\sum_{i=1}^{n}x_i- \sum_{i=1}^{n}log(x_i!)- n\lambda \end{aligned}\]

Z-score transformation

\[ Z = \dfrac{X-\mu}{\sigma} \\ Z\sigma=X - \mu \]

Normal distribution of means from central limit theorem:

\[ \bar{x} \sim N\left(\mu, \dfrac{\sigma^2}{n}\right)\\ Z\dfrac{\sigma}{\sqrt{n}} = \bar{x} - \mu\\ \bar{x} = Z\dfrac{\sigma}{\sqrt n} + \mu \]

For a type I error rate of \(\alpha\), denoted \(Z_\alpha\):

\[ Pr(Z > Z_\alpha) = 1 - \alpha \]

A sample of n=50 adult men reveals that their average daily intake of protein is x¯=75.6 grams per day with a standard deviation of s=3.5 grams. Construct a 95% confidence interval for the average daily intake of protein for men.

The confidence interval is

$$ n = 50 \ {x} = 75.6 \ s = 3.5\ = 0.05 \

{x} Z_{/2}SE_{{x}} \

SE_{{x}} = = \ Z_{/2} = Z_{0.05/2} = 1.96\

{x} 1.96\dfrac{3.5}{\sqrt{50}} \ {x} 0.970 \ 75.6 0.97 \ (74.63, 76.57)

$$

The 95% confidence interval is (74.63, 76.57).

A random sample of n=985 Queensland residents seeking their opinions on how the Queensland State Government was handling the COVID-19 crisis. Of those surveyed, x=592 indicated that they approved of the current government’s handling of the COVID-19 crisis. Construct a 90% confidence interval for the proportion of the population that approves of the current government’s handling of the COVID-19 crisis.

$$ \begin{aligned} n &= 985\ x &= 592\ &= = 0.60 \

&= (1 - 0.9) = 0.1, \text{90% confidence interval} \

SE_{} &= \ &= \ &0.0156\ Z_{/2} &=Z_{0.1/2} 1.645\

&Z_{/2}SE_{} \ 0.60 &1.645 0.0156 \ 0.60 &0.0256\

&(0.574, 0.625) \end{aligned} $$

0.6 + 1.645 * 0.0156
## [1] 0.625662

Practical Question 2

Find the confidence interval for the mean of the IMDB.Ranking for Star Trek (The Original Series).

library(MXB107)
data(episodes)

original_episodes <- episodes %>% filter(Series.Name == "The Original Series")

# Stating your variables
x_bar <- original_episodes$IMDB.Ranking %>% mean()
n <- nrow(original_episodes)
s <- original_episodes$IMDB.Ranking %>% sd()

# Find Standard Error
se <- s / sqrt(n)
print(se)
## [1] 0.08987058
# Find critical Z score
alpha <- 0.05
Z_alpha_half <- qnorm(1 - alpha/2)

# Apply confidence interval formula
upper_limit <- x_bar + Z_alpha_half * se
lower_limit <- x_bar - Z_alpha_half * se

# Display results
lower_limit
## [1] 7.360107
upper_limit
## [1] 7.712393
params<-episodes%>%
  filter(Series == "TOS")%>%
  summarise(xbar = mean(IMDB.Ranking),std_dev = sd(IMDB.Ranking), N = n())

xbar <- pull(params,xbar)
s <- pull(params,std_dev)
n <- pull(params,N)

SE<-s/sqrt(n)

UCL<-xbar+1.96*SE

LCL<-xbar-1.96*SE

LCL
## [1] 7.360104
UCL
## [1] 7.712396

Workshop 8 (20/9/22)

Online Shopping Example

The quality of a product can be reflected by their star ratings.

When browsing for a new pair of Airbuds on JB Hifi, I found two choices with the following rating distributions.

Sources: https://www.jbhifi.com.au/products/lg-tone-free-fp9a-wireless-anc-in-ear-headphones-with-plug-play#reviews https://www.jbhifi.com.au/products/sony-wf-1000xm4-truly-wireless-noise-cancelling-in-ear-headphones-black#reviews

earphones <- data.frame(
  "Ratings" = 1:5,
  "Sony" = c(67, 48, 94, 273, 759),
  "LG" = c(4,5,3,25,109)
)

earphones %>%
  pivot_longer(c("Sony","LG")) %>%
  ggplot(aes(x = Ratings, y=value)) +
  geom_histogram(stat="identity") +
  geom_text(aes(label=value), vjust=-0.25)+
  facet_wrap(~name)+
  xlab("Rating")+
  ylab("Number of customers")+
  ggtitle("Customer Ratings on JB Hifi for LG and Sony Airbuds")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Some of their important statistics can also be calculated:

alpha <- 0.05
n_lg <- sum(earphones$LG)
n_sony <- sum(earphones$Sony)

averages <- earphones %>% summarise_at(c("Sony","LG"), funs(average = sum(Ratings*.)/sum(.)))
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
lg_bar <- averages$LG_average
lg_s <- sum((earphones$Ratings - lg_bar)^2 * earphones$LG) / (sum(earphones$LG) - 1)

sony_bar <-averages$Sony_average
sony_s <- sum((earphones$Ratings - sony_bar)^2 * earphones$Sony) / (sum(earphones$Sony) - 1)

Z_critical <- qnorm(1 - alpha)
cat("LG Size:\t", n_lg, "\n")
## LG Size:  146
cat("LG Average:\t", lg_bar, "\n")
## LG Average:   4.575342
cat("LG sd:\t\t", lg_s, "\n")
## LG sd:        0.8253188
cat("Sony Size:\t", n_sony, "\n")
## Sony Size:    1241
cat("Sony Average:\t", sony_bar, "\n")
## Sony Average:     4.296535
cat("Sony sd:\t", sony_s, "\n")
## Sony sd:  1.241028

Which one would you pick?

We all lean towards the Sony one with more ratings, but lower average.

Go to www.menti.com and use the code 2380 3183

Are they both good?

It is generally believed that a good product should have an average rating of at least 4.25 (or 85%). Given the reviews above, is there enough evidence to conclude that both Airbuds are good products, given a Type I error rate of 0.05?

Since we want to find if a product has greater than or at least 4.25 stars on average. This should be our alternate hypothesis.

\[ H_A = \mu \geq 4.25 \]

That means our null hypothesis is:

\[ H_0 = \mu < 4.25 \]

For a type I error rate of \(\alpha = 0.05\), the critical value is:

alpha <- 0.05
Z_critical <- qnorm(1 - alpha) # Since this is a one sided test
Z_critical
## [1] 1.644854

\[ Z_{\alpha} = Z_{0.05} \approx 1.645 \]

Find the test statistics:

  1. For the Sony earbuds:

Sony Size: 1241 Sony Average: 4.296535 Sony sd: 1.241028

\[ \begin{aligned} n &= 1241\\ \bar{x} &= 4.297\\ s &= 1.241\\ \mu_0 &= 4.25 \\ Z &= \dfrac{\bar{x} - \mu_0}{s/\sqrt{n}}\\ &= \dfrac{4.297 - 4.25}{1.241/\sqrt{1241}}\\ &= 1.334 \end{aligned} \]

Check if test statistics is greater than critical statistics (because the null hypothesis has the form \(H_0: \mu \leq \mu_0\).

\[ Z > Z_{\alpha}?\\ Z = 1.334 < 1.645 \]

This means our test statistics does not pass the test. We fail to reject the null hypothesis.

Alternately, we say that the Sony earbuds may not be a good product.

(4.297 - 4.25)/(1.241/sqrt(1241))
## [1] 1.334172
  1. For the LG earbuds:

LG Size: 146 LG Average: 4.575342 LG sd: 0.8253188

\[ \begin{aligned} n &= 146\\ \bar{x} &= 4.575\\ s&= 0.825\\ \mu_0 &= 4.25\\ Z_\alpha &= 1.645\\ Z &= \dfrac{\bar{x} - \mu_0}{s/\sqrt{n}}\\ &= \dfrac{4.575 - 4.25}{0.825/\sqrt{146}}\\ &\approx 4.760 > Z_{\alpha} \end{aligned} \]

We reject the null hypothesis and accept that the LG earbuds is a good product.

(4.575 - 4.25) / (0.825/sqrt(146))
## [1] 4.759988

Which pair is better, statistically?

Perform a test to check if the LG buds are statistically better than the Sony buds.

Define our alternate hypothesis as:

\[ H_A: \mu_{LG} > \mu_{Sony}\\ H_A: \mu_{LG} - \mu_{Sony} > 0 \]

That means our null hypothesis is:

\[ H_0: \mu_{LG} - \mu_{Sony} \leq 0 \]

This is a one sided test.

LG Size: 146 LG Average: 4.575342 LG sd: 0.8253188 Sony Size: 1241 Sony Average: 4.296535 Sony sd: 1.241028

\[ \begin{aligned} \Delta_0 &= 0 \\ \bar{x_{lg}} &= 4.575\\ s_{lg} &= 0.825\\ n_{lg} &= 146\\ \bar{x_{sony}} &= 4.297\\ s_{sony} &= 1.241\\ n_{sony} &= 1241\\ \end{aligned} \]

Since \(\alpha = 0.05\), the critical Z score is the same as above:

\[ Z_\alpha = 1.645 \]

\[ \begin{aligned} Z &= \dfrac{(\bar{x_{lg}} - \bar{x_{sony}})-\Delta_0}{\sqrt{\dfrac{s_{lg}^2}{n_{lg}}+\dfrac{s_{sony}^2}{n_{sony}}}}\\ &= \dfrac{(4.575 - 4.297) - 0}{\sqrt{\dfrac{0.825^2}{146}+\dfrac{1.241^2}{1241}}}\\ &\approx 3.618 \end{aligned} \]

num <- 4.575 - 4.297
den <- sqrt(0.825^2/146 + 1.241^2/1241)
num/den
## [1] 3.618389

Test if the test statistics is greater than the critical statistic (\(Z_{\alpha} = 1.645\)).

\[ Z > Z_\alpha\\ \text{since}\\ 3.618 > 1.645 \]

Hence, we reject the null hypothesis and conclude that the LG buds are better than the Sony buds in terms of ratings.

Workshop 9 (4/10/22)

Example 1 Synthetic Diamonds

A new process for producing synthetic diamonds is only profitable if the average weight of the diamonds produced is greater than 0.5 carats. Six samples from the new process are created weighing: (0.46,0.61,0.52,0.48,0.57,0.54) carats. Set up a hypotheses test to determine whether or not the process is profitable. Assume a Type I error rate of α=0.05.

The altnerate hypothesis is “the average weight of the diamonds produced is greater than 0.5 carats”.

\[ H_A: \mu > 0.5\\ H_0: \mu \leq 0.5\\ \]

Given variables:

\[ \begin{aligned} \alpha &= 0.05\\ n &= 6\\ \end{aligned} \] Find mean \(\bar{x}\) and standard deviation \(s\):

\[ \begin{aligned} \bar{x} &= \dfrac{0.46+0.61+0.52+0.48+0.57+0.54}{6}\\ &= 0.53\\ s &= \sqrt{\dfrac{\sum(x-\bar{x})^2}{n-1}}\\ &= \sqrt{\dfrac{(0.46 - 0.53)^2+(0.61-0.53)^2+(0.52-0.53)^2+(0.48-0.53)^2+(0.57-0.53)^2+(0.54-0.53)^2}{6 - 1}}\\ &\approx 0.056 \end{aligned} \]

(Or find the mean and standard deviation using R)

x <- c(0.46,0.61,0.52,0.48,0.57, 0.54)
mean(x)
## [1] 0.53
sd(x)
## [1] 0.05585696

The degree of freedom is \(\nu=n -1= 6 - 1 = 5\). Therefore, the critical test statistics is \(t_{\nu,\alpha} = t_{5, 0.05} = 2.015\) (using the t distribution table).

(Or confirm using r)

alpha <- 0.05
nu <- 5
t_critical <- qt(1 - alpha, 5)
t_critical
## [1] 2.015048

And the test statistics is \(T = \sqrt{n}\dfrac{\bar{x}-\mu_0}{s}\).

\[ \begin{aligned} T &= \sqrt{6}\dfrac{0.53-0.5}{0.056}\\ &= 1.316 < T_{5, 0.05} \end{aligned} \] Since the test statistic is less than the critical test statistics (in the case of \(H_0:\mu < \mu_0\), we check if \(T > T_{\nu,\alpha}\)), we failed to reject the null hypothesis. Hence we conclude that there is not enough evidence to statistically confirm that the diamond-making process is profitable.

Example 4: Unequal variances

Data are collected to measure the impact strength of two different packaging materials

df <- tibble('Material 1' = c(1.25,1.16,1.33,1.15,1.23,1.20,1.32,1.28,1.21),
              'Material 2' = c(0.89,0.91,0.97,0.95,0.94,0.92,0.98,0.96,0.98))

Is there a difference between the strengths of the two packing materials? Test this using the appropriate hypotheses at the Type I Error Rate of α=0.05.

Our alternate hypothesis is that there is a difference:

\[ H_A: \mu_1 \neq \mu_2\\ H_A: \mu_1 -\mu_2 \neq \Delta_0 \]

Where in our case, \(\Delta_0 = 0\).

\[ \alpha = 0.05\\ n_1 = n_2 = 9 \]

Find the mean and standard deviation using R:

stats <- df %>% summarise_at(c("Material 1", "Material 2"), .funs = list(mean = ~mean(.), sd = ~sd(.)))
stats

\[ \bar{x_1} \approx 1.24\\ \bar{x_2} \approx 0.94\\ s_1 \approx 0.064\\ s_2 \approx 0.032\\ \]

Check if we need to use unequal variances

$$ > 3 ?\

= 4 > 3 $$ Therefore, we need to use unequal variances.

Note: If the two variances are equal, use \(\nu = n_1 + n_2 - 2\). Otherwise, use the below formula for unequal variance.

Hence the degree of freedom (for the t test is):

\[ \nu \approx \dfrac{(\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2})^2}{\dfrac{(s_1^2/n_1)^2}{n_1-1} + \dfrac{(s_2^2/n_2)^2}{n_2-1}}\\ \nu = \dfrac{(\dfrac{0.064^2}{9} + \dfrac{0.032^2}{9})^2}{\dfrac{(0.064^2/9)^2}{9-1} + \dfrac{(0.032^2/9)^2}{9-1}}\\ \]

x_1_bar <- stats$`Material 1_mean`
x_2_bar <- stats$`Material 2_mean`
s_1 <- stats$`Material 1_sd`
s_2 <- stats$`Material 2_sd`
n_1 <- 9
n_2 <- 9

num <- (s_1^2/n_1 + s_2^2/n_2)^2
den <- (s_1^2/n_1)^2 / (n_1 - 1) + (s_2^2/n_2)^2 / (n_2 - 1)

nu <- num / den
nu
## [1] 11.73352

We found that \(\nu \approx 11.73\).

Find the critical test statistics using R:

alpha <- 0.05
t_critical <- qt(1 - alpha / 2, nu) # alpha / 2 because we are using a two-sided test
t_critical
## [1] 2.184314

So the critical test statistics is \(T_{\nu,\alpha/2} \approx 2.18\). We reject the null hypothesis if \(T > |T_{\nu,\alpha/2}|\).

Find the test statistics:

\[ \begin{aligned} T &= \dfrac{(\bar{x_1} - \bar{x_2}) - \Delta_0}{s_p}\\ s_p &= \sqrt{\dfrac{s_1^2}{n_1} + \dfrac{s_2^2}{n_2}}\\ &= \sqrt{\dfrac{0.064^2}{9} + \dfrac{0.032^2}{9}}\\ &\approx 0.024 \\ T &= \frac{1.24 - 0.94}{s_p}\\ &= 12.5 > |2.18| \end{aligned} \]

Hence, we reject the null hypothesis and we say that there is a statistically significant difference between the two materials.

sqrt(0.064^2/9 + 0.032^2/9)
## [1] 0.02385139
(1.24 - 0.94) / 0.024
## [1] 12.5

Example 5: Paired Difference

We test the difference between two types of tyre (Type A and Type B) by selecting one tyre of each type at random and then fitting them to the rear wheels of one of five cars. Each car is driven for a specified number of kilometres, and the observed wear is measured.

df <- data.frame(A = c(10.6,9.8,12.3,9.7,8.8), B = c(10.2,9.4,11.8,9.1,8.3))

\[ \alpha = 0.05\\ n = 5\\ \]

Since each sample (row) also depends on the car, the two groups of measurements are not independent and we can use the paired difference.

diff <- data.frame(Difference = df$A - df$B)
diff

Form the hypothesises:

\[ H_A: \mu_d \neq 0 \\ H_0: \mu_d = 0 \]

Then we can find the mean and standard deviation of the difference (either by hand or using R):

diff %>% summarise_at(c("Difference"), .funs = list(mean = ~mean(.), sd = ~sd(.)))

\[ \bar{d} = 0.48\\ s = 0.084\\ \]

Small sample, so need to use t-distribution, where the degree of freedom is \(\nu = n - 1=5-1=4\). Hence the critical test statistics (using the t table):

\[ T_{\nu, \alpha/2} = T_{4, 0.025} \approx 2.776 \]

And we would reject the null hypothesis if \(T > |T_{\nu, \alpha/2}|\). The test statistics is:

\[ T = \sqrt{n}\dfrac{\bar{d} - 0}{s} \\ = \sqrt{5}\dfrac{0.48}{0.084}\\ = 12.78 > |2.776| \] Since \(T > |T_{\nu, \alpha/2}|\), we reject the null hypothesis and conclude that there is enough evidence showing that there is a difference between the two tyres.

Notes Confidence interval for the difference in sample means

\[ \bar{x} - \bar{y} \pm T_{\nu,\alpha/2} \sqrt{s_p^2\left(\dfrac{1}{n_x} + \dfrac{1}{n_y}\right)} \]

Workshop 10 (11/10/22)

y <- c(4.13,4.07,4.04,4.07,4.05,4.04,4.02,4.06,4.10,4.04,
       3.86,3.85,4.08,4.11,4.08,4.01,4.02,4.04,3.97,3.95,
       4.00,4.02,4.01,4.01,4.04,3.99,4.03,3.97,3.98,3.98,
       3.88,3.88,3.91,3.95,3.92,3.97,3.92,3.90,3.97,3.90,
       4.02,3.95,4.02,3.89,3.91,4.01,3.89,3.89,3.99,4.00,
       4.02,3.86,3.96,3.97,4.00,3.82,3.98,3.99,4.02,3.93,
       4.00,4.02,4.03,4.04,4.10,3.81,3.91,3.96,4.05,4.06)
labs <- c(rep("Lab 1", 10),rep("Lab 2", 10), rep("Lab 3",10), rep("Lab 4",10),
          rep("Lab 5",10), rep("Lab 6",10),rep("Lab 7",10))

df <- tibble(y = y, lab = labs)
df %>%
  ggplot(aes(x = lab, y = y)) +
  geom_boxplot()

df
sd(y)
## [1] 0.07184294
model <- lm(y ~ lab, data = df)
panderOptions('missing',"")
model%>%aov()%>%pander()
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
lab 6 0.1247 0.02079 5.66 9.453e-05
Residuals 63 0.2314 0.003673
total_variance <- var(df$y)
total_variance
## [1] 0.005161408
lab_1_variance <- (df%>%filter(lab=="Lab 1"))$y%>%var()
lab_1_variance
## [1] 0.001062222
alpha <- 0.05
F_critical <- qf(1 - alpha, 6, 63)
F_critical
## [1] 2.246408

According to the ANOVA result table, the test F statistic is 5.66, which is greater than the critical F statistic (begin approximately 2.25). Therefore, we reject the null hypothesis that all the lab means are the same.

We conclude that the means of some labs are different from the others.

0.02079 / 0.003673 # Calculating F statistic treatment mean squared error / residuals mean squared error
## [1] 5.660223
1 - pf(5.66, 6, 63)
## [1] 9.453964e-05

Tukey Honest Significant Difference (HSD) Test

anova_model <- model%>%aov()
tukey_model <- anova_model%>%TukeyHSD()

# Round the result to look nicer
tukey_model$lab%>%round(3)
##               diff    lwr    upr p adj
## Lab 2-Lab 1 -0.065 -0.148  0.018 0.217
## Lab 3-Lab 1 -0.059 -0.142  0.024 0.323
## Lab 4-Lab 1 -0.142 -0.225 -0.059 0.000
## Lab 5-Lab 1 -0.105 -0.188 -0.022 0.005
## Lab 6-Lab 1 -0.107 -0.190 -0.024 0.004
## Lab 7-Lab 1 -0.064 -0.147  0.019 0.232
## Lab 3-Lab 2  0.006 -0.077  0.089 1.000
## Lab 4-Lab 2 -0.077 -0.160  0.006 0.083
## Lab 5-Lab 2 -0.040 -0.123  0.043 0.758
## Lab 6-Lab 2 -0.042 -0.125  0.041 0.714
## Lab 7-Lab 2  0.001 -0.082  0.084 1.000
## Lab 4-Lab 3 -0.083 -0.166  0.000 0.048
## Lab 5-Lab 3 -0.046 -0.129  0.037 0.620
## Lab 6-Lab 3 -0.048 -0.131  0.035 0.572
## Lab 7-Lab 3 -0.005 -0.088  0.078 1.000
## Lab 5-Lab 4  0.037 -0.046  0.120 0.818
## Lab 6-Lab 4  0.035 -0.048  0.118 0.853
## Lab 7-Lab 4  0.078 -0.005  0.161 0.076
## Lab 6-Lab 5 -0.002 -0.085  0.081 1.000
## Lab 7-Lab 5  0.041 -0.042  0.124 0.736
## Lab 7-Lab 6  0.043 -0.040  0.126 0.691

According to the Tukey HSD result table, the following pairs are different:

  • Lab 4 and Lab 1
  • Lab 5 and Lab 1
  • Lab 6 and Lab 1
  • Lab 4 and Lab 3

Pr(>F) and hypothesis testing

\[ Pr(F > F_{\alpha}) = \alpha\\ F_{test} > F_{\alpha}?\\ Pr(F > F_{test}) < \alpha \]

Mobile Phone Usage

df <- tibble(`Usage_Level` = c("Low", "Medium","High"),A = c(27,68,308), B = c(24,76,326), C = c(31,65,312), D = c(23,67,300))
long_df <- df %>% pivot_longer(cols = c("A","B","C","D"))
model_noblocking <- lm(value~name, data = long_df)
anova_model <- model_noblocking%>%aov()
anova_model%>%pander()
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
name 3 222.2 74.08 0.003126 0.9997
Residuals 8 189577 23697
long_df

Using the non-blocking method, we reject the null hypothesis and conclude that that the prices are not different between different companies.

model_blocking <- lm(value~Usage_Level + name, data = long_df)
anova_model_blocking <- model_blocking%>%aov()
anova_model_blocking %>% pander()
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
Usage_Level 2 189335 94668 2352 2.067e-09
name 3 222.3 74.08 1.841 0.2404
Residuals 6 241.5 40.25

We see that the Pr(>F) has dropped quite significantly (from 0.9997 to 0.2404), however, since this new value is still much greater than the desired type I error rate of \(\alpha = 0.05\), we fail to reject the null hypothesis and conclude that the prices are not really different between different companies.

long_df

Two-factor experiment

Similar to blocking, but both variables are observed and intereted in.

df <- tibble(Supervisor = c(1,1,1,2,2,2)%>%as.factor(), 
             Day = c(571,610,625,480,516,465), 
             Swing = c(480,474,540,625,600,581),
             Night = c(470,430,450,630,680,661))

long_df <- df %>% pivot_longer(cols = c("Day", "Swing", "Night"), names_to = "ShiftTime")
linear_model <- lm(value ~ Supervisor + ShiftTime + Supervisor:ShiftTime, data = long_df)
linear_model %>% aov() %>%pander()
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
Supervisor 1 19208 19208 26.68 0.0002351
ShiftTime 2 247 123.5 0.1715 0.8444
Supervisor:ShiftTime 2 81127 40564 56.34 7.95e-07
Residuals 12 8640 720

Base on the analysis of variance model for the two factors (Supervisor and ShiftTime), we see that Supervisor has a significant effect on the production (with a Pr(>F) of 0.0002351) and also, the combination of Supervisor and ShiftTime also has a significant effect.

Use Tukey HSD to find the paired difference:

aov_model <- linear_model %>% aov()
hsd <- aov_model%>%TukeyHSD()
hsd$"Supervisor:ShiftTime" %>% round(3) %>% pander()
  diff lwr upr p adj
2:Day-1:Day -115 -188.6 -41.41 0.002
1:Night-1:Day -152 -225.6 -78.41 0
2:Night-1:Day 55 -18.59 128.6 0.195
1:Swing-1:Day -104 -177.6 -30.41 0.005
2:Swing-1:Day 0 -73.59 73.59 1
1:Night-2:Day -37 -110.6 36.59 0.563
2:Night-2:Day 170 96.41 243.6 0
1:Swing-2:Day 11 -62.59 84.59 0.995
2:Swing-2:Day 115 41.41 188.6 0.002
2:Night-1:Night 207 133.4 280.6 0
1:Swing-1:Night 48 -25.59 121.6 0.309
2:Swing-1:Night 152 78.41 225.6 0
1:Swing-2:Night -159 -232.6 -85.41 0
2:Swing-2:Night -55 -128.6 18.59 0.195
2:Swing-1:Swing 104 30.41 177.6 0.005

Using the significant rows: * 2:Day-1:Day -115 -188.6 -41.41 0.002 : Supervisor 1 is better than 2 during the day (due to the negative difference 2 - 1) * 1:Night-1:Day -152 -225.6 -78.41 0 : super visor 1 does better in day than night shifts (due to the negative difference Night - Day) * 1:Swing-1:Day -104 -177.6 -30.41 0.005: Supervisor 1 does better in day than swing shifts * 2:Night-2:Day 170 96.41 243.6 0 : Supervisor 2 does better in night than day shifts * 2:Swing-2:Day 115 41.41 188.6 0.002: Supervisor 2 does better in night than swing shifts * ..

Key Takeaway: Compare the final columns of result tables (Pr(>F) or p adj) with the type I error rate.

If it is less than the type I error rate, then conclude there is a significant difference between the means (for ANOVA) or the means of a pair (for TukeyHSD).

The rest is deciding your linear model (lm):

For one-factor: lm(value ~ Factor) For blocking: lm(value ~ Factor + Blocking) For two-factor: lm(value ~ Factor_1 + Factor_2 + Factor_1:Factor_2)

Workshop 11 (18/10/22)

Simple Linear Regression

\[ S_{xy} = \sum(x_i - \bar{x})(y_i - \bar{y})\\ S_{xx} = \sum(x_i - \bar{x})^2 \]

df<-tibble(y = c(64,79,51,83,93,88,63,97,55,74), 
           x = c(39,43,22,63,56,47,27,74,34,53))
plot <- df %>% ggplot()+
  geom_point(aes(x = x, y = y))

x_bar <- mean(df$x)
y_bar <- mean(df$y)

S_xy <- sum((df$x - x_bar)*(df$y - y_bar))
S_xx <- sum((df$x - x_bar)^2)

beta_1 <- S_xy / S_xx
beta_0 <- y_bar - beta_1 * x_bar

plot + 
  geom_abline(slope = beta_1, intercept = beta_0)

model <- lm(y~ x, df)
summary(model) %>% pander()
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.01 8.052 4.348 0.00245
x 0.8665 0.1667 5.199 0.0008234
Fitting linear model: y ~ x
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
10 8.1 0.7716 0.7431
beta_0
## [1] 35.01287
beta_1
## [1] 0.8665312

Anova for Linear Regression

Firstly, see that without a linear model, the unexplained variance is high:

S_yy <- sum((df$y - y_bar)^2)
S_yy
## [1] 2298.1

Now that we perform linear regression on the x variable, we found that the x variable is responsible for a large proportion of the variance. Because of this, we found that the sum of unexplained variance has dropped down significantly (to 65.6).

model%>%aov()%>%pander()
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
x 1 1773 1773 27.03 0.0008234
Residuals 8 524.8 65.6

Coefficient of Determination

The coefficient of determination finds the proportion of variance explained by the regression model.

\[ R^2 = \dfrac{SSR}{SST} \]

SSR <- 1773 # Found from the anova table
SST <- S_yy
R_2 <- SSR / SST
R_2
## [1] 0.7715069

Testing the regression results

Check if the error terms are normally distributed (i.e. the data points scatter randomly around the line of best fit). For this we can use a residual plot and a qq plot

Residual Plot

df.model <- data.frame(fitted.values = model$fitted.values,
                       residuals = model$residuals)
df.model %>% ggplot(aes(x = fitted.values,y = residuals))+
  geom_point()+
  geom_smooth(method = "lm",se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

QQ Plot

df.model %>% 
  ggplot(aes(x = residuals)) +
  geom_histogram()+
  stat_function(fun = dnorm, args = list(mean = 0, sd = sd(df.model$residuals)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

var(df.model$residuals)
## [1] 58.31451

We found that the histogram does not effectively show if the error terms are normally distributed. So we use a QQ plot

ggplot(df.model,aes(sample = residuals))+
  geom_qq_line()+
  geom_qq()

mean(df.model$residuals)
## [1] -4.458239e-17

EPA Data: Overview

data("epa_data")

df <- epa_data %>% filter(year == 1990)
model<-lm(city~hwy,df)

beta_0 <- model$coefficients[1]
beta_1 <- model$coefficients[2]

plt <- df %>% 
  ggplot() +
  geom_point(aes(x = hwy, y = city)) +
  ggtitle("City Fuel Economy against Highway Fuel Economy") +
  xlab("City Fuel Economy (mpg)") +
  ylab("Highway Fuel Economy (mpg)")

plt + geom_abline(intercept = beta_0, slope = beta_1, color="red", 
                 linetype="solid", size=1)

interested_epa_data <- epa_data %>% filter(year >= 2011 & year <= 2021) 

epa_data_model <- lm(city ~ trans + disp + trans:disp, interested_epa_data)
epa_data_model%>%aov()%>%pander()
Analysis of Variance Model
  Df Sum Sq Mean Sq F value Pr(>F)
trans 1 3856 3856 228.9 2.96e-51
disp 1 230172 230172 13663 0
trans:disp 1 10.91 10.91 0.6475 0.421
Residuals 12461 209925 16.85

Workshop 12 (23/10/22)

Fisher’s exact test

Page 1 Page 2 Page 3

Chi squared test

Words <- c("a", "an", "this", "that", "with", "without")
Admirer <- c(83,29,15,22,43,4)
Austen <- c(434,62,86,236,161,38)

df <- tibble(Words,Admirer,Austen)
df %>% ggplot(aes(x = Words, y = Admirer)) +
  geom_histogram(stat ="Identity")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

df %>% ggplot(aes(x = Words, y = Austen)) +
  geom_histogram(stat ="Identity")
## Warning: Ignoring unknown parameters: binwidth, bins, pad

Page 4 Page 5 Page 6 Page 7